Title

Introduction

Methods

Data Import

if (!require(NHANES)) {
  install.packages("NHANES", quiet = TRUE)  
}
library(NHANES)
# head(NHANES)

Initial Variable Selection

Ruling out variables by reasoning or by exploratory analysis

We have chosen BMI (Body mass index (weight/height2 in kg/m2)) as our response variable. In NHANES, this data is reported for participants aged 2 years or older, so that will focus on participants over 2 years old for our analysis. Provided below are all the predictors in NHANES, along with our response variable BMI.

sort(names(NHANES)) # alphabetic order
##  [1] "Age"              "Age1stBaby"       "AgeDecade"        "AgeFirstMarij"   
##  [5] "AgeMonths"        "AgeRegMarij"      "Alcohol12PlusYr"  "AlcoholDay"      
##  [9] "AlcoholYear"      "BMI"              "BMI_WHO"          "BMICatUnder20yrs"
## [13] "BPDia1"           "BPDia2"           "BPDia3"           "BPDiaAve"        
## [17] "BPSys1"           "BPSys2"           "BPSys3"           "BPSysAve"        
## [21] "CompHrsDay"       "CompHrsDayChild"  "DaysMentHlthBad"  "DaysPhysHlthBad" 
## [25] "Depressed"        "Diabetes"         "DiabetesAge"      "DirectChol"      
## [29] "Education"        "Gender"           "HardDrugs"        "HeadCirc"        
## [33] "HealthGen"        "Height"           "HHIncome"         "HHIncomeMid"     
## [37] "HomeOwn"          "HomeRooms"        "ID"               "Length"          
## [41] "LittleInterest"   "Marijuana"        "MaritalStatus"    "nBabies"         
## [45] "nPregnancies"     "PhysActive"       "PhysActiveDays"   "Poverty"         
## [49] "PregnantNow"      "Pulse"            "Race1"            "Race3"           
## [53] "RegularMarij"     "SameSex"          "SexAge"           "SexEver"         
## [57] "SexNumPartnLife"  "SexNumPartYear"   "SexOrientation"   "SleepHrsNight"   
## [61] "SleepTrouble"     "Smoke100"         "Smoke100n"        "SmokeAge"        
## [65] "SmokeNow"         "SurveyYr"         "Testosterone"     "TotChol"         
## [69] "TVHrsDay"         "TVHrsDayChild"    "UrineFlow1"       "UrineFlow2"      
## [73] "UrineVol1"        "UrineVol2"        "Weight"           "Work"

We will add all omitted predictors into an overview table df_exclude. The variables we would like to keep will be in df_keep.

  1. Some predictors can be ruled out right away. Our response variable is BMI, so we should not use body Weight or Height as predictors, because BMI is calculated by dividing the Weight by Height.

  2. The next group of predictors seems very closely related either by name or logic deduction, for example, Age, AgeDecade, AgeMonths. Let’s quickly double-check if they are linearly related:

pairs(subset(NHANES, select = c('Age', 'Age1stBaby', 'AgeDecade', 'AgeMonths',
                                'AgeFirstMarij', 'AgeRegMarij')))

Age, AgeDecade and AgeMonth are clearly collinear, so we will only keep Age. Likewise, both variables for Marijuana use appear collinear, so we keep only one, say AgeRegMarij and may decide to drop it later if it is not useful. We might keep Age1stBaby.

  1. Now let’s check for collinearity between different blood pressure related variables:
to_test = c("BPDia1", "BPDia2", "BPDia3", "BPDiaAve", "BPSys1", "BPSys2", "BPSys3", "BPSysAve" )
pairs(subset(NHANES, select=to_test))

The blood pressure variables fall into two groups: diastolic and systolic blood pressure readings. We would expect there to be strongly collinearity within each group, which is the case. So, we only keep the average in each group BPDiaAve and BPSysAve.

For the following variable assessments for collinearity, see the appendix for their pairs() plot.

  1. Next, let’s check all variables related to alcohol: We again performed a pairs() graph to visualize possible collinearity, and this graph is in the appendix. Collinearity is not as clear in this case, but we believe one predictor related to alcohol consumption may be sufficient. We will keep AlcoholYear.

  2. Let’s now investigate the collinearity of other drug-related variables (note: AgeRegMarij was in the other group as well and we kept it). Most of these predictors are categorical, so collinearity cannot be seen, except for SmokeAge and AgeRegMarij. The latter makes sense as this drug is usually consumed via smoking. We can thus use one as a proxy for the other. Let’s keep SmokeNow and HardDrugs as proxies for drug abuse and its potential effect on BMI.

  3. Next, let’s investigate a few life-style variables related to being physically active or the opposite thereof, screen time: Due to the nature of these variables being categorical, a clear picture of collinearity is not observable. Let’s keep half of these parameters for now, which are the ones with a bit denser levels, PhysActiveDays, TVHrsDay, CompHrsDay.

  4. Now we should look into some other health related variables. Let’s see for cholesterol and diabetes related predictors: DirectChol and TotChol appear to be collinear, let’s keep TotChol. Out of the diabetes related ones, we keep Diabetes.

  5. Now let’s analyze more health related variables, such as those related to urine volume and flow below: Urine volume and urine flow appear collinear. Moreover, there might be collinearity between the first and second urine measurement, respectively. Let’s keep UrineVol1 for now.

  6. Next up are a somewhat heterogenic group of variables related to health or mental health. For example, somebody who is depressed might show little interest in doing things. Again, collinearity is not easy to spot in categorical variables. Let’s pick LittleInterest as a mild form of mental health issue (which might lead to little physical activity and obesity) and HealthGen as a general health rating.

  7. We decided to keep Poverty which is a ratio of family income to poverty guidelines, and drop HHIncomeMid and HHIncome, as they both capture similar information to what the Poverty variable captures.

  8. Finally, let’s add nPregnancies, Poverty, SleepHrsNight, Gender, Race1, Education, and MartialStatus as we believe they can have an effect on BMI, and do not suspect collinearity.

#Setting up the data frames with the variables we will be excluding and keeping for model building

df_exclude = data.frame(predictor = c('Weight', 'Height', 'Age1stBaby', 'AgeDecade', 'AgeMonth', 'AgeRegMarij', 'Alcohol12PlusYr', 'AlcoholDay', 'Smoke100', 'SmokeAge', 'Marijuana', 'RegularMarij', "BPDia1", "BPDia2", "BPDia3", "BPSys1", "BPSys2", "BPSys3", 'PhysActive', 'TVHrsDayChild', 'CompHrsDayChild', 'DirectChol', 'DiabetesAge', "UrineFlow1", "UrineVol2", "UrineFlow2", "DaysPhysHlthBad", "DaysMentHlthBad", "Depressed", 'Race3'), 
reason_to_omit = c('linear dependence with BMI','linear dependence with BMI', 'redundant with nPregnancies and many NAs', 'collinear with Age', 'collinear with Age', 'redundant with Marijuana', 'more sparse than AlcoholYear', 'redundant with AlcoholYear', 'redundant with SmokeNow', 'collinear with AgeRegMarij', 'redundant with AgeRegMarij, the two might be swapped', 'redundant with Marijuana', 'collinear with other blood pressure predictors', 'collinear with other blood pressure predictors', 'collinear with other blood pressure predictors', 'collinear with other blood pressure predictors', 'collinear with other blood pressure predictors', 'collinear with other blood pressure predictors', 'Redundant with PhysActiveDays', 'redundant with TVHrsDay', 'redundant with CompHrsDay', 'collinear with TotChol', 'redundant with Diabetes', 'collinear with UrineVol1', 'collinear with UrineVol1', 'collinear with UrineVol1', 'redundant with HealthGen', 'redundant with HealthGen', 'redundant with HealthGen', 'redundant with Race1'))

df_keep = data.frame(predictor = c('Age', 'AlcoholYear', 'Marijuana', 'SmokeNow', 'HardDrugs', 'BPDiaAve', 'BPSysAve', 'PhysActiveDays', 'TVHrsDay', 'CompHrsDay', 'TotChol', 'Diabetes', 'UrineVol1', 'HealthGen', 'LittleInterest', 'Poverty', 'SleepHrsNight', 'Gender', 'Race1', 'Education', 'MaritalStatus' ))
knitr::kable(df_keep, caption = "Initial Predictor Selected")
Initial Predictor Selected
predictor
Age
AlcoholYear
Marijuana
SmokeNow
HardDrugs
BPDiaAve
BPSysAve
PhysActiveDays
TVHrsDay
CompHrsDay
TotChol
Diabetes
UrineVol1
HealthGen
LittleInterest
Poverty
SleepHrsNight
Gender
Race1
Education
MaritalStatus

Let’s build a dataset nhanes_select using just the above df_keep variables.

nhanes_select = subset(NHANES, select =c(df_keep$predictor, "BMI"))

The resulting dataset, after initial variable selection above, consists of 10000 observations (rows) and 22 variables (columns) including BMI and the chosen predictors.

Convert Categorical into Factor Variables

We need to know which ones are categorical and turn them into factors

nhanes_select$Marijuana = as.factor(nhanes_select$Marijuana)
nhanes_select$SmokeNow = as.factor(nhanes_select$SmokeNow)
nhanes_select$HardDrugs = as.factor(nhanes_select$HardDrugs)
nhanes_select$Diabetes = as.factor(nhanes_select$Diabetes)
nhanes_select$TVHrsDay = as.factor(nhanes_select$TVHrsDay)
nhanes_select$CompHrsDay = as.factor(nhanes_select$CompHrsDay)
nhanes_select$HealthGen = as.factor(nhanes_select$HealthGen)
nhanes_select$LittleInterest = as.factor(nhanes_select$LittleInterest)
nhanes_select$Gender = as.factor(nhanes_select$Gender)
nhanes_select$Race1 = as.factor(nhanes_select$Race1)
nhanes_select$Education = as.factor(nhanes_select$Education)
nhanes_select$MaritalStatus = as.factor(nhanes_select$MaritalStatus)

Missing Values

It would be helpful to have a dataset which is devoid of NAs (missing values). First let’s get a quick idea of how many missing values are present in our initial dataset.

Which variables have majority Nan values

library(tidyverse, quietly = TRUE)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Count the NA values in each column
na_counts = colSums(is.na(nhanes_select))

# Calculate the percentage of NA values in each column
total_rows = nrow(nhanes_select)
na_percentage = (na_counts / total_rows) * 100

# Create a data frame to store the results
na_summary = data.frame(Column = names(na_counts), NA_Count = na_counts, NA_Percentage = na_percentage)
na_summary = na_summary %>%
  arrange(desc(NA_Percentage))

# Print the summary
print(na_summary)
##                        Column NA_Count NA_Percentage
## SmokeNow             SmokeNow     6789         67.89
## PhysActiveDays PhysActiveDays     5337         53.37
## TVHrsDay             TVHrsDay     5141         51.41
## CompHrsDay         CompHrsDay     5137         51.37
## Marijuana           Marijuana     5059         50.59
## HardDrugs           HardDrugs     4235         42.35
## AlcoholYear       AlcoholYear     4078         40.78
## LittleInterest LittleInterest     3333         33.33
## Education           Education     2779         27.79
## MaritalStatus   MaritalStatus     2769         27.69
## HealthGen           HealthGen     2461         24.61
## SleepHrsNight   SleepHrsNight     2245         22.45
## TotChol               TotChol     1526         15.26
## BPDiaAve             BPDiaAve     1449         14.49
## BPSysAve             BPSysAve     1449         14.49
## UrineVol1           UrineVol1      987          9.87
## Poverty               Poverty      726          7.26
## BMI                       BMI      366          3.66
## Diabetes             Diabetes      142          1.42
## Age                       Age        0          0.00
## Gender                 Gender        0          0.00
## Race1                   Race1        0          0.00

The table above is sorted according to NA percentage in descending order. The top 5 predictors as far as NAs are concerned are: nPregnancies, SmokeNow, PhysActiveDays, TVHrsDay, CompHrsDay. Half of all predictors have greater than 25% missing values. If we eliminated all rows with any missing value, we would be left with only 419, which is not enough observations to be meaningful. We cannot simply proceed using this data, as any regression tools we will use will need to eliminate many observations in order to proceed with the statistical calculations. It would also be inappropriate to simply eliminate these observations, although this was previously the standard approach. Eliminating this many observations would bring into question how well our study models represent the underlying population. Interpretation of our results would become more difficult, and suspicious of selective observation elimination introducing bias. This data was also costly to produce - we prefer to not simply cast it aside.

Data imputation involves the substitution of missing data with a different value. Although there are simple methods of replacing missing values with the mean or median of the variable in question, the most robust method is multiple imputation. Multiple imputation involves the generation of multiple complete datasets by replacing the missing values with data values which are modeled for each missing entry, from a plausible distribution. The imputation process can use a variety of methods for computing the imputed values, depnding upon the underlying distribution of the observed values, and the relationship of those observed values and the other variables in the observation. Once the multiple complete datasets are generated, any analysis can be performed (such as linear regression) and the results of each analyses are pooled into one set of results.

We will perform the mutiple imputation process with the mice package below. More information regarding the mice package can be read at the book website Flexible Imputation of Missing Data

Data Imputation with the mice Package

if (!require(mice)) {
  install.packages("mice", quiet = TRUE)  
}

Here we will perform the imputation. Given the size of the data, this will take a bit of processing time. First we will remove the observations where there is no entry for BMI as there are only 366, to avoid imputation of our response variable.

library(mice, quietly = TRUE)
## 
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
## 
##     filter
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
# remove the rows which have NAs for BMI
nhanes_imp = nhanes_select[!is.na(nhanes_select$BMI), ]

# perform the multiple imputation (5 datasets)
imp = mice(nhanes_imp, seed = 420, m = 5, print = FALSE)

See appendix for density plots comparing imputed and observe values.

Initial Model Building and Diagnostics

First we will build a complete additive model, to allow an initial diagnostic evaluation.

# perform the linear regression with each of the 5 imputed datasets
fit_add <- with(imp, lm(BMI ~  Age +  AlcoholYear + Marijuana + SmokeNow +
  HardDrugs + BPDiaAve + BPSysAve + PhysActiveDays + TVHrsDay + CompHrsDay +
  TotChol +  Diabetes +  UrineVol1 +  HealthGen +  LittleInterest + Poverty +
  SleepHrsNight +  Gender +  Race1 +  Education +  MaritalStatus))

summary(fit_add$analyses[[1]])
## 
## Call:
## lm(formula = BMI ~ Age + AlcoholYear + Marijuana + SmokeNow + 
##     HardDrugs + BPDiaAve + BPSysAve + PhysActiveDays + TVHrsDay + 
##     CompHrsDay + TotChol + Diabetes + UrineVol1 + HealthGen + 
##     LittleInterest + Poverty + SleepHrsNight + Gender + Race1 + 
##     Education + MaritalStatus)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -17.72  -4.02  -0.66   3.08  51.07 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               15.971309   0.902813   17.69  < 2e-16 ***
## Age                        0.068493   0.004923   13.91  < 2e-16 ***
## AlcoholYear               -0.007182   0.000682  -10.53  < 2e-16 ***
## MarijuanaYes              -0.110896   0.141892   -0.78  0.43450    
## SmokeNowYes               -1.844623   0.145042  -12.72  < 2e-16 ***
## HardDrugsYes              -0.535036   0.179859   -2.97  0.00294 ** 
## BPDiaAve                   0.076354   0.004811   15.87  < 2e-16 ***
## BPSysAve                   0.028586   0.004629    6.18  6.9e-10 ***
## PhysActiveDays            -0.120430   0.033572   -3.59  0.00034 ***
## TVHrsDay0_to_1_hr         -1.028120   0.441569   -2.33  0.01992 *  
## TVHrsDay1_hr              -1.466664   0.432008   -3.39  0.00069 ***
## TVHrsDay2_hr              -0.138588   0.422957   -0.33  0.74317    
## TVHrsDay3_hr               0.820815   0.432167    1.90  0.05755 .  
## TVHrsDay4_hr               0.137412   0.448163    0.31  0.75915    
## TVHrsDayMore_4_hr          0.135049   0.443332    0.30  0.76066    
## CompHrsDay0_to_1_hr        0.756341   0.186713    4.05  5.1e-05 ***
## CompHrsDay1_hr             2.080135   0.203571   10.22  < 2e-16 ***
## CompHrsDay2_hr             1.995214   0.232239    8.59  < 2e-16 ***
## CompHrsDay3_hr             2.278443   0.273593    8.33  < 2e-16 ***
## CompHrsDay4_hr             3.198275   0.376541    8.49  < 2e-16 ***
## CompHrsDayMore_4_hr        5.487766   0.315182   17.41  < 2e-16 ***
## TotChol                    0.194370   0.064065    3.03  0.00242 ** 
## DiabetesYes                2.191698   0.248486    8.82  < 2e-16 ***
## UrineVol1                  0.001975   0.000695    2.84  0.00447 ** 
## HealthGenVgood             1.522558   0.202334    7.52  5.7e-14 ***
## HealthGenGood              3.432174   0.204019   16.82  < 2e-16 ***
## HealthGenFair              4.319456   0.261867   16.49  < 2e-16 ***
## HealthGenPoor              5.911734   0.468975   12.61  < 2e-16 ***
## LittleInterestSeveral      0.312386   0.165730    1.88  0.05947 .  
## LittleInterestMost        -0.374333   0.263538   -1.42  0.15552    
## Poverty                   -0.068833   0.044977   -1.53  0.12595    
## SleepHrsNight             -0.152794   0.047284   -3.23  0.00124 ** 
## Gendermale                -0.232813   0.129759   -1.79  0.07281 .  
## Race1Hispanic             -1.352504   0.311397   -4.34  1.4e-05 ***
## Race1Mexican              -0.506982   0.279392   -1.81  0.06962 .  
## Race1White                -1.187517   0.203948   -5.82  6.0e-09 ***
## Race1Other                -2.874504   0.286073  -10.05  < 2e-16 ***
## Education9 - 11th Grade   -0.018180   0.322924   -0.06  0.95510    
## EducationHigh School       0.253344   0.310195    0.82  0.41411    
## EducationSome College      0.098083   0.310173    0.32  0.75184    
## EducationCollege Grad     -0.388556   0.329384   -1.18  0.23817    
## MaritalStatusLivePartner  -0.551224   0.314469   -1.75  0.07966 .  
## MaritalStatusMarried      -0.383803   0.244461   -1.57  0.11645    
## MaritalStatusNeverMarried -1.556719   0.272387   -5.72  1.1e-08 ***
## MaritalStatusSeparated     0.315622   0.469454    0.67  0.50140    
## MaritalStatusWidowed      -2.033354   0.377985   -5.38  7.6e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.98 on 9588 degrees of freedom
## Multiple R-squared:  0.345,  Adjusted R-squared:  0.342 
## F-statistic:  112 on 45 and 9588 DF,  p-value: <2e-16

Construct a dataframe of all our 5 of our imputed datasets, with the additional values added of columns .imp for the imputation number, and .i for the observation number within that imputation.

imp_df = mice::complete(imp, action = "long")

Variance and Normality Assessment

Let’s do some tests on this model to identify potential issues.

library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
if (!require(nortest)) {
  install.packages("nortest", quiet = TRUE)  
}
## Loading required package: nortest
# install.packages("nortest", quiet = TRUE)
library(nortest, quietly = TRUE)

### First, define some functions ###

# Function to calculate the LOOCVRMSE
calc_loocv_rmse = function(model) {
  sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}

# model diagnostics
model_diagnostics = function(fit){
  fit_summary <- data.frame(bptest_p = rep(0,5), ad_test = rep(0,5))
  for (i in 1:5){
    fit_summary$bptest_p[i] = unname(bptest(fit$analyses[[i]])$p.value)
    ad.test(residuals(fit$analyses[[i]]))$p.value 
  }
  knitr::kable(fit_summary, col.names = c("BP Test", "AD Test"))
}

# cooks distance to check for influential observations
cooks_function = function(fit){
  cook_thresh = 4 * cooks.distance(fit$analyses[[1]]) / nrow(imp)
  mean(cooks.distance(fit$analyses[[1]]) > cook_thresh)
}
 
# model assessments
model_assess = function(fit){
  fit_summary <- data.frame(adj_r_squared = rep(0,5), loocv_rmse = rep(0,5))
  for (i in 1:5){
    fit_summary$adj_r_squared[i] = summary(fit$analyses[[i]])$adj
    fit_summary$loocv_rmse[i] = calc_loocv_rmse(fit$analyses[[i]]) 
  }
  knitr::kable(fit_summary, col.names = c("Adj. R-Squared", "LOOCV-RMSE"))
}
#Fitted versus Residuals Plot
plot(fitted(fit_add$analyses[[1]]), resid(fit_add$analyses[[1]]), col = "darkblue", pch = 20,
     xlab = "Fitted", ylab = "Residuals", main = "Fitted versus Residuals Plot")
abline(h=0,col = "darkorange")

Fitted-residuals plot reveals deviation from homoskedasticity. Also, the p.value of the BP-test, to check for the constant variance assumption, is suspicious. The Shapiro-Wilk test does appear suspicious, suggesting that the normality assumption is violated.

#Normal Q-Q Plot
qqnorm(resid(fit_add$analyses[[1]]), col = "dodgerblue")
qqline(resid(fit_add$analyses[[1]]), col = "darkorange")

The Q-Q-Plot might show some deviations from normality. Again, the subset is small making it difficult to establish a pattern.

Let’s look at the Shapiro Test for normality, and the Breusch-Pagan Test for Homoscedasticity.

model_diagnostics(fit_add)
BP Test AD Test
0 0
0 0
0 0
0 0
0 0

The p-values for these test, using each of the 5 imputed dataset models, are all very low, essentially 0. So we reject the null hypothesis, calling into question, both normality and homoscedasticity. However, both of these tests are susceptable to the influence of large sample sizes, so they may be less reliable in this setting.

Collinearity

In the summary above many parameters had large p-values. Let’s check these for variance inflation.

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
car::vif(fit_add$analyses[[1]])
##                 GVIF Df GVIF^(1/(2*Df))
## Age            3.099  1           1.760
## AlcoholYear    1.175  1           1.084
## Marijuana      1.325  1           1.151
## SmokeNow       1.414  1           1.189
## HardDrugs      1.233  1           1.110
## BPDiaAve       1.371  1           1.171
## BPSysAve       1.708  1           1.307
## PhysActiveDays 1.040  1           1.020
## TVHrsDay       1.460  6           1.032
## CompHrsDay     1.492  6           1.034
## TotChol        1.259  1           1.122
## Diabetes       1.191  1           1.091
## UrineVol1      1.064  1           1.031
## HealthGen      1.427  4           1.045
## LittleInterest 1.159  2           1.038
## Poverty        1.537  1           1.240
## SleepHrsNight  1.065  1           1.032
## Gender         1.133  1           1.064
## Race1          1.586  4           1.059
## Education      1.899  4           1.083
## MaritalStatus  2.432  5           1.093

None of the variable appear to have a large (>5) variance inflation factor.

Because of the findings above, we will fit a model using the log transformation of the response variable (BMI), and reassess the diagnostics.

# perform the linear regression with each of the 5 imputed datasets
# and the log() transform of BMI
fit_add_log <- with(imp, lm(log(BMI) ~  Age +  AlcoholYear + Marijuana + SmokeNow +
  HardDrugs + BPDiaAve + BPSysAve + PhysActiveDays + TVHrsDay + CompHrsDay +
  TotChol +  Diabetes +  UrineVol1 +  HealthGen +  LittleInterest + Poverty +
  SleepHrsNight +  Gender +  Race1 +  Education +  MaritalStatus))

summary(fit_add_log$analyses[[1]])
## 
## Call:
## lm(formula = log(BMI) ~ Age + AlcoholYear + Marijuana + SmokeNow + 
##     HardDrugs + BPDiaAve + BPSysAve + PhysActiveDays + TVHrsDay + 
##     CompHrsDay + TotChol + Diabetes + UrineVol1 + HealthGen + 
##     LittleInterest + Poverty + SleepHrsNight + Gender + Race1 + 
##     Education + MaritalStatus)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7153 -0.1488 -0.0075  0.1338  1.0414 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                2.8000078  0.0321729   87.03  < 2e-16 ***
## Age                        0.0034561  0.0001755   19.70  < 2e-16 ***
## AlcoholYear               -0.0002533  0.0000243  -10.42  < 2e-16 ***
## MarijuanaYes              -0.0037485  0.0050565   -0.74  0.45852    
## SmokeNowYes               -0.0648351  0.0051688  -12.54  < 2e-16 ***
## HardDrugsYes              -0.0173073  0.0064095   -2.70  0.00694 ** 
## BPDiaAve                   0.0029675  0.0001715   17.31  < 2e-16 ***
## BPSysAve                   0.0009464  0.0001650    5.74  9.9e-09 ***
## PhysActiveDays            -0.0048322  0.0011964   -4.04  5.4e-05 ***
## TVHrsDay0_to_1_hr         -0.0360859  0.0157359   -2.29  0.02186 *  
## TVHrsDay1_hr              -0.0547425  0.0153952   -3.56  0.00038 ***
## TVHrsDay2_hr              -0.0047818  0.0150726   -0.32  0.75106    
## TVHrsDay3_hr               0.0255240  0.0154008    1.66  0.09749 .  
## TVHrsDay4_hr               0.0031905  0.0159708    0.20  0.84166    
## TVHrsDayMore_4_hr          0.0000724  0.0157987    0.00  0.99634    
## CompHrsDay0_to_1_hr        0.0311888  0.0066537    4.69  2.8e-06 ***
## CompHrsDay1_hr             0.0802491  0.0072545   11.06  < 2e-16 ***
## CompHrsDay2_hr             0.0805117  0.0082761    9.73  < 2e-16 ***
## CompHrsDay3_hr             0.0882696  0.0097498    9.05  < 2e-16 ***
## CompHrsDay4_hr             0.1226157  0.0134185    9.14  < 2e-16 ***
## CompHrsDayMore_4_hr        0.1859430  0.0112319   16.55  < 2e-16 ***
## TotChol                    0.0087585  0.0022830    3.84  0.00013 ***
## DiabetesYes                0.0569458  0.0088551    6.43  1.3e-10 ***
## UrineVol1                  0.0001056  0.0000248    4.27  2.0e-05 ***
## HealthGenVgood             0.0599624  0.0072104    8.32  < 2e-16 ***
## HealthGenGood              0.1262192  0.0072705   17.36  < 2e-16 ***
## HealthGenFair              0.1545469  0.0093319   16.56  < 2e-16 ***
## HealthGenPoor              0.1970503  0.0167125   11.79  < 2e-16 ***
## LittleInterestSeveral      0.0092696  0.0059060    1.57  0.11656    
## LittleInterestMost        -0.0096256  0.0093915   -1.02  0.30542    
## Poverty                   -0.0028051  0.0016028   -1.75  0.08012 .  
## SleepHrsNight             -0.0058481  0.0016850   -3.47  0.00052 ***
## Gendermale                -0.0010097  0.0046241   -0.22  0.82715    
## Race1Hispanic             -0.0359625  0.0110970   -3.24  0.00120 ** 
## Race1Mexican              -0.0012320  0.0099565   -0.12  0.90153    
## Race1White                -0.0401927  0.0072679   -5.53  3.3e-08 ***
## Race1Other                -0.0984992  0.0101946   -9.66  < 2e-16 ***
## Education9 - 11th Grade   -0.0020877  0.0115078   -0.18  0.85604    
## EducationHigh School       0.0081584  0.0110542    0.74  0.46051    
## EducationSome College      0.0040085  0.0110534    0.36  0.71688    
## EducationCollege Grad     -0.0112692  0.0117380   -0.96  0.33705    
## MaritalStatusLivePartner  -0.0160915  0.0112065   -1.44  0.15106    
## MaritalStatusMarried      -0.0118043  0.0087117   -1.36  0.17545    
## MaritalStatusNeverMarried -0.0611841  0.0097069   -6.30  3.0e-10 ***
## MaritalStatusSeparated     0.0096622  0.0167296    0.58  0.56358    
## MaritalStatusWidowed      -0.0798839  0.0134700   -5.93  3.1e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.213 on 9588 degrees of freedom
## Multiple R-squared:  0.389,  Adjusted R-squared:  0.386 
## F-statistic:  136 on 45 and 9588 DF,  p-value: <2e-16
#Fitted versus Residuals Plot
#Fitted versus Residuals Plot
par(mfrow=c(1,2))
plot(fitted(fit_add$analyses[[1]]), resid(fit_add$analyses[[1]]), col = "darkblue", pch = 20,
     xlab = "Fitted", ylab = "Residuals", main = "Fitted vs Residuals")
abline(h=0,col = "darkorange")
plot(fitted(fit_add_log$analyses[[1]]), resid(fit_add_log$analyses[[1]]), col = "darkblue", pch = 20,
     xlab = "Fitted", ylab = "Residuals", main = "Fitted vs Residuals - log(BMI)")
abline(h=0,col = "darkorange")

The log transformation of BMI model looks much better, though still not perfect.

Now the Q-Q plot:

#Normal Q-Q Plot
par(mfrow=c(1,2))
# no transformation
qqnorm(resid(fit_add$analyses[[1]]), col = "dodgerblue")
qqline(resid(fit_add$analyses[[1]]), col = "darkorange")
# log transformation
qqnorm(resid(fit_add_log$analyses[[1]]), col = "dodgerblue")
qqline(resid(fit_add_log$analyses[[1]]), col = "darkorange")

Again, the log transformation of BMI results in a much better appearing QQ plot. Moving forward, we will use the log transformed BMI for our model building.

Stepwise Model Selection

Now we build and run the stepwise process to select predictors. Notice that our 5 datasets with observed and imputed data are passed to the stepwise function using with() which in this case returns a mira object from the mice package.

First we will start with the additive model and perform a stepwise backward AIC model selection.

# build the stepwise workflow
scope <- list(upper = ~ Age +  AlcoholYear + Marijuana + SmokeNow +
  HardDrugs + BPDiaAve + BPSysAve + PhysActiveDays + TVHrsDay + CompHrsDay +
  TotChol +  Diabetes +  UrineVol1 +  HealthGen +  LittleInterest + Poverty +
  SleepHrsNight +  Gender +  Race1 +  Education +  MaritalStatus,
             lower = ~ 1)
expr <- expression(f1 <- lm(log(BMI) ~ 1),
                  f2 <- step(f1, scope = scope, trace = 0))
# perform the stepwise selection with each of the 5 imputed datasets
fit <- with(imp, expr)

# count the votes for variables to keep
formulas <- lapply(fit$analyses, formula)
terms <- lapply(formulas, terms)
votes <- unlist(lapply(terms, labels))
table(votes)
## votes
##            Age    AlcoholYear       BPDiaAve       BPSysAve     CompHrsDay 
##              5              5              5              5              5 
##       Diabetes      Education      HardDrugs      HealthGen LittleInterest 
##              5              4              5              5              4 
##      Marijuana  MaritalStatus PhysActiveDays        Poverty          Race1 
##              1              5              5              3              5 
##  SleepHrsNight       SmokeNow        TotChol       TVHrsDay      UrineVol1 
##              5              5              5              5              5

If we use the criterion of more than half of the datasets resulted in selection of a variable, we end up only dropping Marijuana. Lets compare models using anova, leaving out variables with less than 5 votes.

# remove Education
model_without = with(imp, lm(log(BMI) ~ Age +  AlcoholYear + SmokeNow +
  HardDrugs + BPDiaAve + BPSysAve + PhysActiveDays + TVHrsDay + CompHrsDay +
  TotChol +  Diabetes +  UrineVol1 +  HealthGen +  LittleInterest + Poverty +
  SleepHrsNight +  Gender +  Race1 +  MaritalStatus))
model_with = with(imp, lm(log(BMI) ~ Age +  AlcoholYear + SmokeNow +
  HardDrugs + BPDiaAve + BPSysAve + PhysActiveDays + TVHrsDay + CompHrsDay +
  TotChol +  Diabetes +  UrineVol1 +  HealthGen +  LittleInterest + Poverty +
  SleepHrsNight +  Gender +  Race1 +  Education +  MaritalStatus))
anova(model_without, model_with)
##    test statistic df1   df2 dfcom p.value    riv
##  2 ~~ 1     1.645   4 104.1  9589  0.1686 0.4589

This p-value is not significant, so we fail to reject the null hypothesis, so we can discard Education.

# remove LittleInterest
model_without = with(imp, lm(log(BMI) ~ Age +  AlcoholYear + SmokeNow +
  HardDrugs + BPDiaAve + BPSysAve + PhysActiveDays + TVHrsDay + CompHrsDay +
  TotChol +  Diabetes +  UrineVol1 +  HealthGen + Poverty +
  SleepHrsNight +  Gender +  Race1 +  MaritalStatus))
model_with = with(imp, lm(log(BMI) ~ Age +  AlcoholYear + SmokeNow +
  HardDrugs + BPDiaAve + BPSysAve + PhysActiveDays + TVHrsDay + CompHrsDay +
  TotChol +  Diabetes +  UrineVol1 +  HealthGen +  LittleInterest + Poverty +
  SleepHrsNight +  Gender +  Race1 +  MaritalStatus))
anova(model_without, model_with)
##    test statistic df1   df2 dfcom p.value    riv
##  2 ~~ 1     1.862   2 18.38  9593  0.1835 0.8327

Again, we fail to reject the null, and can remove LittleInterest

# remove Poverty
model_without = with(imp, lm(log(BMI) ~ Age +  AlcoholYear + SmokeNow +
  HardDrugs + BPDiaAve + BPSysAve + PhysActiveDays + TVHrsDay + CompHrsDay +
  TotChol +  Diabetes +  UrineVol1 +  HealthGen +
  SleepHrsNight +  Gender +  Race1 +  MaritalStatus))
model_with = with(imp, lm(log(BMI) ~ Age +  AlcoholYear + SmokeNow +
  HardDrugs + BPDiaAve + BPSysAve + PhysActiveDays + TVHrsDay + CompHrsDay +
  TotChol +  Diabetes +  UrineVol1 +  HealthGen + Poverty +
  SleepHrsNight +  Gender +  Race1 +  MaritalStatus))
anova(model_without, model_with)
##    test statistic df1 df2 dfcom p.value    riv
##  2 ~~ 1     5.658   1   4  9595  0.0761 0.1411

This one is getting close to 0.05. We still should consider dropping it, but for now we will keep it.

Here is the final model of this process which we will call fit_add_aic

fit_add_aic = with(imp, lm(log(BMI) ~ Age +  AlcoholYear + SmokeNow +
  HardDrugs + BPDiaAve + BPSysAve + PhysActiveDays + TVHrsDay + CompHrsDay +
  TotChol +  Diabetes +  UrineVol1 +  HealthGen + Poverty +
  SleepHrsNight +  Gender +  Race1 +  MaritalStatus))
summary(fit_add_aic$analyses[[1]])
## 
## Call:
## lm(formula = log(BMI) ~ Age + AlcoholYear + SmokeNow + HardDrugs + 
##     BPDiaAve + BPSysAve + PhysActiveDays + TVHrsDay + CompHrsDay + 
##     TotChol + Diabetes + UrineVol1 + HealthGen + Poverty + SleepHrsNight + 
##     Gender + Race1 + MaritalStatus)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7075 -0.1483 -0.0081  0.1345  1.0531 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                2.7998629  0.0307142   91.16  < 2e-16 ***
## Age                        0.0034141  0.0001745   19.56  < 2e-16 ***
## AlcoholYear               -0.0002615  0.0000239  -10.93  < 2e-16 ***
## SmokeNowYes               -0.0636024  0.0051182  -12.43  < 2e-16 ***
## HardDrugsYes              -0.0175517  0.0060359   -2.91  0.00365 ** 
## BPDiaAve                   0.0029479  0.0001709   17.25  < 2e-16 ***
## BPSysAve                   0.0009912  0.0001644    6.03  1.7e-09 ***
## PhysActiveDays            -0.0050202  0.0011918   -4.21  2.5e-05 ***
## TVHrsDay0_to_1_hr         -0.0379522  0.0156881   -2.42  0.01557 *  
## TVHrsDay1_hr              -0.0561498  0.0153620   -3.66  0.00026 ***
## TVHrsDay2_hr              -0.0045574  0.0150447   -0.30  0.76196    
## TVHrsDay3_hr               0.0262886  0.0153949    1.71  0.08774 .  
## TVHrsDay4_hr               0.0046249  0.0159340    0.29  0.77163    
## TVHrsDayMore_4_hr          0.0015405  0.0157845    0.10  0.92225    
## CompHrsDay0_to_1_hr        0.0298719  0.0065714    4.55  5.5e-06 ***
## CompHrsDay1_hr             0.0796477  0.0071133   11.20  < 2e-16 ***
## CompHrsDay2_hr             0.0788360  0.0081333    9.69  < 2e-16 ***
## CompHrsDay3_hr             0.0859230  0.0096117    8.94  < 2e-16 ***
## CompHrsDay4_hr             0.1202162  0.0132678    9.06  < 2e-16 ***
## CompHrsDayMore_4_hr        0.1840787  0.0110627   16.64  < 2e-16 ***
## TotChol                    0.0090033  0.0022788    3.95  7.8e-05 ***
## DiabetesYes                0.0567895  0.0088489    6.42  1.4e-10 ***
## UrineVol1                  0.0001047  0.0000248    4.23  2.4e-05 ***
## HealthGenVgood             0.0607387  0.0071743    8.47  < 2e-16 ***
## HealthGenGood              0.1282799  0.0072059   17.80  < 2e-16 ***
## HealthGenFair              0.1565532  0.0091507   17.11  < 2e-16 ***
## HealthGenPoor              0.1966292  0.0164770   11.93  < 2e-16 ***
## Poverty                   -0.0038076  0.0015188   -2.51  0.01219 *  
## SleepHrsNight             -0.0059988  0.0016825   -3.57  0.00037 ***
## Gendermale                -0.0012345  0.0046001   -0.27  0.78843    
## Race1Hispanic             -0.0345086  0.0109893   -3.14  0.00169 ** 
## Race1Mexican              -0.0005764  0.0096893   -0.06  0.95256    
## Race1White                -0.0398583  0.0072545   -5.49  4.0e-08 ***
## Race1Other                -0.0996015  0.0100864   -9.87  < 2e-16 ***
## MaritalStatusLivePartner  -0.0172976  0.0111960   -1.54  0.12239    
## MaritalStatusMarried      -0.0124965  0.0086863   -1.44  0.15028    
## MaritalStatusNeverMarried -0.0622401  0.0097018   -6.42  1.5e-10 ***
## MaritalStatusSeparated     0.0099329  0.0167216    0.59  0.55252    
## MaritalStatusWidowed      -0.0787033  0.0133598   -5.89  4.0e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.213 on 9595 degrees of freedom
## Multiple R-squared:  0.388,  Adjusted R-squared:  0.386 
## F-statistic:  160 on 38 and 9595 DF,  p-value: <2e-16

Lets try a forward using bic, and see if we get a smaller model:

# build the stepwise workflow
scope <- list(upper = ~ Age +  AlcoholYear + Marijuana + SmokeNow +
  HardDrugs + BPDiaAve + BPSysAve + PhysActiveDays + TVHrsDay + CompHrsDay +
  TotChol +  Diabetes +  UrineVol1 +  HealthGen +  LittleInterest + Poverty +
  SleepHrsNight +  Gender +  Race1 +  Education +  MaritalStatus,
             lower = ~ 1)
expr <- expression(f1 <- lm(log(BMI) ~ 1),
                  f2 <- step(f1, scope = scope, direction = "forward", 
                             K = log(nrow(imp[["data"]])), trace = 0))
# perform the stepwise selection with each of the 5 imputed datasets
fit <- with(imp, expr)

# count the votes for variables to keep
formulas <- lapply(fit$analyses, formula)
terms <- lapply(formulas, terms)
votes <- unlist(lapply(terms, labels))
table(votes)
## votes
##            Age    AlcoholYear       BPDiaAve       BPSysAve     CompHrsDay 
##              5              5              5              5              5 
##       Diabetes      Education      HardDrugs      HealthGen LittleInterest 
##              5              4              5              5              4 
##      Marijuana  MaritalStatus PhysActiveDays        Poverty          Race1 
##              1              5              5              3              5 
##  SleepHrsNight       SmokeNow        TotChol       TVHrsDay      UrineVol1 
##              5              5              5              5              5

This appears to yield the same votes as the prior method.

Lastly, lets try a stepwise in both directions using AIC

# build the stepwise workflow
scope <- list(upper = ~ Age +  AlcoholYear + Marijuana + SmokeNow +
  HardDrugs + BPDiaAve + BPSysAve + PhysActiveDays + TVHrsDay + CompHrsDay +
  TotChol +  Diabetes +  UrineVol1 +  HealthGen +  LittleInterest + Poverty +
  SleepHrsNight +  Gender +  Race1 +  Education +  MaritalStatus,
             lower = ~ 1)
expr <- expression(f1 <- lm(log(BMI) ~ 1),
                  f2 <- step(f1, scope = scope, direction = "both", 
                             trace = 0))
# perform the stepwise selection with each of the 5 imputed datasets
fit <- with(imp, expr)

# count the votes for variables to keep
formulas <- lapply(fit$analyses, formula)
terms <- lapply(formulas, terms)
votes <- unlist(lapply(terms, labels))
table(votes)
## votes
##            Age    AlcoholYear       BPDiaAve       BPSysAve     CompHrsDay 
##              5              5              5              5              5 
##       Diabetes      Education      HardDrugs      HealthGen LittleInterest 
##              5              4              5              5              4 
##      Marijuana  MaritalStatus PhysActiveDays        Poverty          Race1 
##              1              5              5              3              5 
##  SleepHrsNight       SmokeNow        TotChol       TVHrsDay      UrineVol1 
##              5              5              5              5              5

Still same results. For now, our additive model will be fit_add_aic

Model Diagnostics

Data Interactions

Data Transformations

Outlier Assessment

Final Model

Results

Discussion

Appendix

Variable Selection

Additional pairs() Plots for Collinearity Assessment

Alcohol related variables:

to_test = c("Alcohol12PlusYr", "AlcoholDay", "AlcoholYear")
pairs(subset(NHANES, select = to_test))

Smoking and Drug related variables:

to_test = c("SmokeNow", "Smoke100", "SmokeAge", "Marijuana", "RegularMarij", "AgeRegMarij", "HardDrugs")
pairs(subset(NHANES, select = to_test))

Lifestyle related variables:

to_test = c("PhysActive", "PhysActiveDays", "TVHrsDay", "CompHrsDay", "TVHrsDayChild", "CompHrsDayChild")
pairs(subset(NHANES, select=to_test))

Cholesterol related variables:

to_test = c("DirectChol", "TotChol", "Diabetes", "DiabetesAge")
pairs(subset(NHANES, select = to_test))

Urine related variables:

to_test = c("UrineVol1", "UrineFlow1", "UrineVol2", "UrineFlow2")
pairs(subset(NHANES, select = to_test))

Mental health related variables:

to_test = c("HealthGen", "DaysPhysHlthBad", "DaysMentHlthBad", "LittleInterest", "Depressed" )
pairs(subset(NHANES, select = to_test))

Data Imputation

# Compare the imputed variables (red) and observed (blue)
densityplot(imp)

# summary(imp)